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Neural Network 


Roadmap 


@ Embedding Numerous Features: Kernel Models 
@ Combining Predictive Features: Aggregation Models 


Lecture 11: Gradient Boosted Decision Tree 


aggregating trees from functional gradient and 
steepest descent subject to any error measure 













© Distilling Implicit Features: Extraction Models 


Lecture 12: Neural Network 
Motivation 

Neural Network Hypothesis 
Neural Network Learning 
Optimization and Regularization 
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Neural Network Motivation 


Linear Aggregation of Perceptrons: Pictorial View 


N lg 
i = G(x) =sign | >> a: sign (w}x) 
y ti [~ 
g(x) 
e two layers of weights: 
w; and a 


e two layers of sign functions: 
in gg and in G 





what boundary can G implement? 
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Neural Network Motivation 


Logic Operations with Aggregation 





92 AND(Q1, 92) 





G(x) = sign (—1+91(x)+9e(x)) 


2 IS © 91(X) = go(x) = +1 (TRUE): 
G(x) = +1 (TRUE) 
x 
=o E GO a otherwise: 
< ®«- = a G(x) = —1 (FALSE) 


5 = anal , 92) 






OR, NOT can be similarly implemented ) 
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Neural Network Motivation 


Powerfulness and Limitation 


8 perceptrons 16 perceptrons target boundary 


























e ‘convex set’ hypotheses implemented: dys — oo, remember? :-) 
e powerfulness: enough perceptrons ~ smooth boundary 





gı 92 XOR(91, 92) 
e limitation: XOR not ‘linear separable’ under (x) = (g1 (Xx), 92(Xx)) 


how to implement XOR(g1, 92)? 
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Neural Network Motivation 


Multi-Layer Perceptrons: Basic Neural Network 
e non-separable data: can use more transform 
e how about one more layer of AND transform? 


XOR(91, 92) = OR(AND(—91, 92), AND(91, —92)) 





= G= 
J} xor 91,96) 








perceptron (simple) 
==> aggregation of perceptrons (powerful) 
= multi-layer perceptrons (more powerful) 
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Neural Network Motivation 


Connection to Biological Neurons 








by UC Regents Davis campus-brainmaps.org. 


Licensed under CC BY 3.0 via Wikimedia Commons 



















by Lauris Rubenis. 
Licensed under CC BY 
2.0 via 
httpsi//flic kri 
p/fkVuZX 


neural network: bio-inspired model j 


by Pedro Ribeiro 
Simões. Licensed 
under CC BY 2.0 via 
heeps: /elvien ke 
p/adiv7b 


Neural Network Motivation 


Fun Time 


Let 9o(x) = +1. Which of the following (ao, a1, a2) allows 
2 

G(x) = sign (x axgi(X)) to implement OR(g1, 92)? 
t=0 


—3,+1,+1 
yee teed 
+1, 41,41 


) 
) 
) 
fay eh aes) 


0 (- 
(= 
( 
( 


(2) 
© 
© 





Neural Network Motivation 


Fun Time 


Let 9o(x) = +1. Which of the following (ao, a1, a2) allows 


2 
G(x) = sign (x org0) ) to implement OR(g1, 92)? 














Reference Answer: (3) 
You can easily verify with all four possibilities of 
(91(X), 9o(X)). | 
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Neural Network Neural Network Hypothesis 


Neural Network Hypothesis: Output 


e OUTPUT: simply a 
linear model with 
s =w! (p(x) 
e any linear model can be 
used—remember? :-) 









linear classification {linear regression logistic regression 


h(x) = s 


one 


err = Squared 














h(x) = 0(s) 


Xo 
Xı 
s 
X2 Q h(x) 
Xa 


err = cross-entropy 


h(x) = sign(s) 


Xo 
Xı 
s 
X D h(x) 
Xa 


err = 0/1 














will discuss ‘regression’ with squared error 
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Neural Network 


Neural Network Hypothesis 


Neural Network Hypothesis: Transformation 





e _| : transformation function 
of score (signal) s 
e any transformation? 


° whole network linear & 
thus less useful 





: discrete & thus hard to 


optimize for w 
e popular choice of 
transformation: / = tanh(s) 
e ‘analog’ approximation of 





: easier to optimize 


e somewhat closer to 
biological neuron 





e not that new! :-) 








exp(s) — exp(—s) 
exp(s) + exp(—s) 
20(2s) — 1 


tanh(s) 





will discuss with tanh as transformation | 


function 
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Neural Network Neural Network Hypothesis 


Neural Network Hypothesis 





1<0<LL layers aK =i) 


1<j<d® outputs 


gg | 0<i<d®!) inputs | score sl”) = z w 
i=0 


apply x as input layer x), go through hidden 
layers to get x, predict at output layer x) 
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Neural Network Neural Network Hypothesis 


Physical Interpretation 





e each layer: transformation to be learned from data 
a(é—1) 
(9) ,(€-1) 
e (x) = tanh a 


—whether x ‘matches’ weight vectors in pattern 





NNet: pattern extraction with 
layers of connection weights | 


Neural Network Neural Network Hypothesis 


Fun Time 


How many weights {wh} are there in a 3-5-1 NNet? 
09 
@ 15 
© 20 
© 26 





Neural Network Neural Network Hypothesis 


Fun Time 


How many weights {wh} are there in a 3-5-1 NNet? 
09 
@ 15 
© 20 
© 26 










Reference Answer: (4) 


There are (3 + 1) x 5 weights in w), and 
(5+1) x 1 weights in wie), 


Neural Network Neural Network Learning 


How to Learn the Weights? 


@ 





wee 


e goal: learning all {wh} to minimize Ein ({m}}) 


e one hidden layer: simply aggregation of perceptrons 
—gradient boosting to determine hidden neuron one by one 


e multiple hidden layers? not easy 
o let €n = (Yn — NNet(x,))*: 





next: efficient computation of auf 
aw 
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Neural Network Neural Network Eeaniilg 





hy (Output Layer) 


i—0 


2 
dei 
en = (yn — NNet(xn))? = (yn 2 a = en = 5 wee) 





specially (output layer) 


generally (1 < £ < L) 
Galce a) 





C e 


O€n 
en as) 


= -2(yn— st) - (x) Oe 


5) = 2 (yn = a| , how about others? | 
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Neural Network Neural Network Learning 


j L n 
Computing j = a 





aa ) 


(+1) 
t Wik : 

eff an O in |= => en 

I J a 















oa oe Be, as’) ax! 
o me OS, N 


Ee Eats ) 


6 can be computed backwards from 5, *") 


Neural Network Neural Network Learning ? ; 
Backpropagation (Backprop) Algorithm 
Backprop on NNet 
initialize all weights w wid 
Or f =O, laosa fF 
© stochastic: randomly pick n € {1,2,--- , N} 
@ forward: compute all x with x) = x, 
© backward: compute all 59 subject to x) = x, 


© gradient descent: who — wi — eo 


return Jwer(x) = (-- -tanh (x wh wi tanh (Si wy x x/))) 





sometimes © to (3) is (parallelly) done many times and 
average(x\ 5") taken for update in (4), called mini-batch 





basic NNet algorithm: backprop to compute 


the gradient efficiently 
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Neural Network Neural Network Learning 








Fun Time 
fees Oy 1) Ben _ 99 
According to mn 2 (yn S} ) ) when would awit 0% 
Oy = sl!) 
Ə x =0 
I 
© s” =0 
f 


© all of the above 





Neural Network Neural Network Learning 


Fun Time 








According to sal we — 2 (yn = a) 3 a when would ae =0? 


i 


Ə xt) Ji 
© g(t) Sr 


© all of the above 












Reference Answer: (4) 


Note that x1) tanh(s— 9) = = 0 if and only 
it s$” = 0. 


Neural Network Optimization and Regularization 


Neural Network Optimization 


o{ (wm (9 (534%) 


e generally non-convex when multiple hidden layers 


e not easy to reach global minimum 
e GD/SGD with backprop only gives local minimum 





e different initial who => different local minimum 


e somewhat ‘sensitive’ to initial weights 
e large weights — saturate (small gradient) 
e advice: try some random & small ones 


NNet: difficult to optimize, 
but practically works | 
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Neural Network Optimization and Regularization 


VC Dimension of Neural Network Model 


roughly, with tanh-like transfer functions: 
dyco = O( VD) where V = # of neurons, D = # of weights 








e pros: can approximate ‘anything’ if enough neurons (V large) 
e cons: can overrfit if too many neurons 





NNet: watch out for overfitting! 
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Neural Network Optimization and Regularization 


Regularization for Neural Network 
basic choice: 


2 
old friend weight-decay (L2) regularizer Q(w) = ` (wi) 





e ‘shrink’ weights: 
large weight — large shrink; small weight — small shrink 


e want wi = 0 (sparse) to effectively decrease dvc 


e L1 regularizer: >> , but not differentiable 


e weight-elimination (‘scaled’ L2) regularizer: 
large weight — median shrink; small weight median shrink 





@ 
who 









2 
weight-elimination regularizer: Y` Coe 


wW)? 
1w) 
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Neural Network Optimization and Regularization 


Yet Another Regularization: Early Stopping 


out-of-sample error 


e GD/SGD (backprop) visits 
more weight combinations 
as t increases 






model complexity 


Error 





in-sample error 








Wo Ae VC dimension, dy. 
ty (dj, in middle, remember? :-)) 


-0.2 


e smaller t effectively PETAN 
decrease dyco 4 4 N 
e better ‘stop in middle’: EET 
early stopping Ey. 





10? 10° 104 
iteration, t 


when to stop? validation! 
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Neural Network Optimization and Regularization 


Fun Time 


(£) 


2 
For the weight elimination regularizer >> sen mbites eee 


14(wi)”’ ow 





Neural Network Optimization and Regularization 


Fun Time 


Oa , 
For the weight elimination regularizer X ac what is raen 
1+(w ) Ow; 









Reference Answer: a 
Too much calculus in this class, huh? :-) 
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‘Summary 


@ Embedding Numerous Features: Kernel Models 
@ Combining Predictive Features: Aggregation Models 
© Distilling Implicit Features: Extraction Models 


Lecture 12: Neural Network 
ə Motivation 
multi-layer for power with biological inspirations 
ə Neural Network Hypothesis 
layered pattern extraction until linear hypothesis 
ə Neural Network Learning 
backprop to compute gradient efficiently 
ə Optimization and Regularization 
tricks on initialization, regularizer, early stopping 





e next: making neural network ‘deeper’ 





